R Markdown in Practice

Using R markdown

Objectives

This section provides advanced exercises to improve your understanding of R Markdown.

Outline

  1. Setup
  2. Parameters
  3. Caching data
  4. Dependent chunks
  5. Figure size
  6. Figure location
  7. Interactive figures

Setup

Before you get started, you’ll want to create the project folder using the goodenuffR package.

First, install and load the remotes package.

install.packages("remotes")

Then, use remotes::install_github() to download the latest version from GitHub.

# install the package 
remotes::install_github("mjfrigaard/goodenuffR")
Using github PAT from envvar GITHUB_PAT
Downloading GitHub repo mjfrigaard/goodenuffR@HEAD
✔  checking for file ‘/private/var/folders/0x/x5wkbhmx0k74tncn9swz7xpr0000gn/T/
  RtmpMEKbyF/remotesf34816e089a0/mjfrigaard-goodenuffR-a05b005/DESCRIPTION’ ...
─  preparing ‘goodenuffR’:
✔  checking DESCRIPTION meta-information ...
─  checking for LF line-endings in source and make files and shell scripts
─  checking for empty or unneeded directories
─  building ‘goodenuffR_0.1.0.tar.gz’
   
* installing *source* package ‘goodenuffR’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
*** copying figures
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (goodenuffR)

*after installing the package, load it with library(goodenuffR):

# load the package
library(goodenuffR)

We’ll assume I have the folder for this course on my Dropbox folder named FA22-JOUR301, and a sub-folder inside for homework assignments (hw).

Below is an example of what this folder should look like after completing the previous homework assignment:

FA22-JOUR301 # course folder 
  └── hw/ # sub-folder
      └── wk_2_goodenuff_r_assignment/ # week 2 homework files
        ├── 2022-09-01-mfrigaard-wk2-intro-to-goodenuffR.Rmd
        ├── CITATION
        ├── LICENSE
        ├── README.Rmd
        ├── README.md
        ├── code
           ├── 01-import.R
           ├── 02-tidy.R
           ├── 03-wrangle.R
           ├── 04-visualize.R
           ├── 05-model.R
           ├── 06-communicate.R
           └── runall.R
        ├── data
           ├── README.md
           └── raw
        ├── docs
           ├── changelog.txt
           ├── manuscript.Rmd
           └── notebook.Rmd
        ├── requirements.txt
        ├── results
           ├── figures
           ├── manuscript
           └── tables
        └── wk_2_goodenuff_r_assignment.Rproj

First I want to find the course folder using dir()

Then, to create the project folder for week 3, I should enter a name in the project_name argument, and the path to the course folder from above.

The commands should look something like this:

# enter the name of the project and parent folder 
goodenuffR::goodenuff_project(
  project_name = "wk 3 rmarkdown in practice", 
  folder_path = "/Users/mjfrigaard/Dropbox/FA22-JOUR301/hw/")

A new RStudio project should open (see example below)

A new RStudio session should open and have only the .Rproj file in the Files pane:

After your project folder is created, run the goodenuffR::goodenuff_files() to create the project files.

Download the assignment (.Rmd)

After creating your project, head over to the course website and download the .Rmd file for this assignment.

Save the file in the folder you create above.

Click on the file to open it.

Follow the instructions in the top of the .Rmd file.

Resources

Parameters

YAML parameters give us the ability to add variables we can later refer to in our document. We will add some parameters to our report to see how these can be used.

Downloading the data

Before downloading the data file, we should think about where we’re going to store it. Recall we always want to store new datasets in the data/raw folder.

Add the following code to the bottom of the YAML header (like so):

title: "Week 3: R Markdown in Practice"
author: "<Your Name>"
output: html_document

params:
  data_dir: !r file.path("data/raw/starwars.rds")
  list_vars: !r c("films", "vehicles", "starships")

These parameters will give us global control over the data we will be importing (even if that file changes in the future).

The first code chunk in assignment 3 is below:

```{r download.file, eval=FALSE}
# change eval to TRUE
download.file(
  url = "https://github.com/mjfrigaard/FA22-JOUR301/raw/main/data/starwars.rds", 
  destfile = "data/raw/starwars.rds")
```

Note that destfile in the download.file() function is identical to the location we passed to file.path() in our params above.

When you set eval to TRUE in this code chunk, the code will download the data file and place it in the data/raw/ folder.

Importing and Viewing Data

It’s hard to do any analyses without data! We will load a toy dataset from the Star Wars API. Add the code below to your .Rmd file to import the StarWars data. We will also name the code chunk StarWars, because it’s the object this code creates.

```{r StarWars}
StarWars <- readr::read_rds(file = params$data_dir)
```

Note that we’ve loaded these data using the parameters we’ve defined above.

Help files

Details about the variables in the StarWars dataset are accessible in RStudio’s help files, which we can access using ??starwars

```{r StarWars-help}
??starwars
```

When we read the help file, we find there are three variables that are lists: films, vehicles, and starships. We have list-columns because the Star Wars API exports data as a JSON file, which is not tabular (like a spreadsheet).

glimpse()

We can see a basic transposed display of the StarWars data with dplyr’s glimpse() function.

dplyr::glimpse(StarWars)
Rows: 87
Columns: 14
$ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader"…
$ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 1…
$ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0,…
$ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", …
$ skin_color <chr> "fair", "gold", "white, blue", "white", "light", …
$ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue…
$ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24…
$ sex        <chr> "male", "none", "none", "male", "female", "male",…
$ gender     <chr> "masculine", "masculine", "masculine", "masculine…
$ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Ald…
$ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Hum…
$ films      <list> <"The Empire Strikes Back", "Revenge of the Sith…
$ vehicles   <list> <"Snowspeeder", "Imperial Speeder Bike">, <>, <>…
$ starships  <list> <"X-wing", "Imperial shuttle">, <>, <>, "TIE Adv…

glimpse() shows us the format and first few values of each variable in StarWars.

skim()

Below is a skimr::skim() view of the StarWars data. We can see each variable broken down by type, along with some summary information.

skimr::skim(StarWars)
Table 1: Data summary
Name StarWars
Number of rows 87
Number of columns 14
_______________________
Column type frequency:
character 8
list 3
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1.00 3 21 0 87 0
hair_color 5 0.94 4 13 0 12 0
skin_color 0 1.00 3 19 0 31 0
eye_color 0 1.00 3 13 0 15 0
sex 4 0.95 4 14 0 4 0
gender 4 0.95 8 9 0 2 0
homeworld 10 0.89 4 14 0 48 0
species 4 0.95 3 14 0 37 0

Variable type: list

skim_variable n_missing complete_rate n_unique min_length max_length
films 0 1 24 1 7
vehicles 0 1 11 0 2
starships 0 1 17 0 5

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
height 6 0.93 174.36 34.77 66 167.0 180 191.0 264 ▁▁▇▅▁
mass 28 0.68 97.31 169.46 15 55.6 79 84.5 1358 ▇▁▁▁▁
birth_year 44 0.49 87.57 154.69 8 35.0 52 72.0 896 ▇▁▁▁▁

The skimr package is great for looking at large data summaries. Read more here.

jsonedit()

If you have JSON or lists (non-rectangular data) in R, sometimes these objects can be hard to visualize. The jsonedit() function from listviewer makes this easier by giving us an interactive display to click-through.

library(listviewer)
listviewer::jsonedit(listdata = StarWars, mode = "view")

Caching data

When we are analyzing large datasets that take awhile to load, it might make sense to cache the data when it’s loaded into a code chunk.

We can do this by including cache=TRUE in the previous StarWars code chunk.

```{r StarWars, cache=TRUE}
StarWars <- readr::read_rds(file = params$data_dir)
```

Re-knit this chunk with the new cache option.

Dataset size

We can determine the size of our dataset using object.size() from the utils package (which is loaded by default).

object.size(StarWars)
57520 bytes

Another option is using the inspect_mem() function from the inspectdf package.

library(inspectdf)
inspectdf::inspect_mem(df1 = StarWars) %>%
  inspectdf::show_plot(text_labels = TRUE,
                       col_palette = 1)

We can see from the data visualization that the list variables are accounting for most of the memory.

Caching data

Including the cache=TRUE option stores the StarWars data, so that R holds the data in memory until the StarWars import chunk is changed. Sometimes we will only want to analyze a subset of a dataset, so it makes sense to cache the larger dataset import chunk.

```{r StarWars, cache=TRUE}
StarWars <- readr::read_rds(file = params$data_dir)
```

With the StarWars data cached, we can remove the list variables from StarWars and create a StarWarsSmall dataset. We saved the names of the list-columns in params$list_vars.

```{r StarWarsSmall}
StarWarsSmall <- StarWars %>% dplyr::select(-c(params$list_vars))
```

Lets check the size of the new StarWarsSmall data by comparing it to the original StarWars dataset. This code chunk should look like this:

```{r inspect_mem-StarWars-StarWarsSmall}
inspectdf::inspect_mem(df1 = StarWars, df2 = StarWarsSmall) %>%
  inspectdf::show_plot(text_labels = TRUE, col_palette = 1)
```

Cache paths

When we cache data, a new folder named your-file-name + _cache/html/ is created in the same directory as our R Markdown file.

We can see the rmarkdown-reports_cache/html/ folder contents below:

rmarkdown-in-practice_cache/html/
├── StarWars_d2dbabfa9304cbd769d2b1eb26c5ff59.RData
├── StarWars_d2dbabfa9304cbd769d2b1eb26c5ff59.rdb
├── StarWars_d2dbabfa9304cbd769d2b1eb26c5ff59.rdx
└── __packages

We can change the location of the data cache by specifying cache.path either in the code chunk, or in the setup chunk.

```{r StarWars, cache=TRUE, cache.path='data/'}
StarWars <- readr::read_rds(file = params$data_dir)
```

Note: you will need to make sure the cache.path folder exists, which can be solved by adding dir.create() in a code chunk above the StarWars chunk. I like using fs::dir_create(), because it checks to see if a folder exists, then creates one if it doesn’t.

If we want to add cache options to the setup chunk, it would look like this,

```{r setup, include=FALSE}
# create data folder
fs::dir_create(path = "data/")
# set chunk options
knitr::opts_chunk$set(cache = TRUE,
                      cache.path = "data/")
```

Dependent chunks

Data analysis and exploration typically moves along in a (somewhat) linear fashion, which means our code chunks should be run sequentially. Sometimes this isn’t true, and we need some code chunks to depend on other, specific code chunks. In this case, we can use the dependson option in our code chunk.

In the Caching Data tab, we compared StarWars and StarWarsSmall datasets using inspect_mem() in a code chunk named inspect_mem-StarWars-StarWarsSmall. Running this code is only possible after running the code in the StarWars chunk.

We can make the inspect_mem-StarWars-StarWarsSmall dependent on StarWars by adding dependson and the code chunk name.

```{r inspect_mem-StarWars-StarWarsSmall, dependson = "StarWars"}
inspectdf::inspect_mem(df1 = StarWars, df2 = StarWarsSmall) %>%
  inspectdf::show_plot(text_labels = TRUE, col_palette = 1)
```

Now the inspect_mem-StarWars-StarWarsSmall will only execute after the StarWarsSmall chunk has been run.

Figures

Graphs and figures are great tools for communicating results, and we want to keep track of all the visualizations we create in our report. R Markdown comes with multiple options for controlling the size, location, and quality of images in our reports.

Figure Size

We can adjust the size of our figures with fig.height= or fig.width=. These both take numeric values, and control the dimensions of the figure in inches. We can also control the size with out.width= and out.height=.

Below we visualize the average BMI by species and gender in the Star Wars universe. We also load the hrbrthemes package to give us more control over the aesthetics in our plot.

```{r gg_avg_bmi_spec_gend, fig.height=5.5, fig.width=8, out.width='100%', out.height='100%'}
library(hrbrthemes)
StarWars %>%
  dplyr::filter(!is.na(mass) & !is.na(height) & !is.na(species)) %>%
  dplyr::mutate(bmi = mass / ((height / 100)  ^ 2)) %>%
  dplyr::group_by(species, gender) %>%
  dplyr::summarize(mean_bmi = mean(bmi, na.rm = TRUE)) %>%
  dplyr::ungroup() %>%
  dplyr::arrange(desc(mean_bmi)) %>%
  dplyr::mutate(species = reorder(species, mean_bmi)) %>%
  ggplot2::ggplot(aes(x = mean_bmi, y = species,
                      color = as.factor(species),
                      group = gender)) +
  ggplot2::geom_point(show.legend = FALSE) +
  ggplot2::facet_wrap(. ~ gender, scales = "free") +
  ggplot2::labs(title = "Average BMI in Star Wars Universe",
                subtitle = "Grouped by species and gender",
                caption = "source = https://swapi.dev/",
                x = "Mean BMI", y = "Species") +
  hrbrthemes::theme_ipsum_rc(axis_text_size = 9,
                             axis_title_size = 13,
                             strip_text_size = 13) -> gg_avg_bmi_spec_gend
gg_avg_bmi_spec_gend
```

We can see this figure fits the page well because we are able to control the size of the height and width.

Figure Location

Now that we’ve created a few figures, we can see how these get stored to be used in the final .html file. Much like the default cache settings, when we create graphs in R Markdown, a default folder is created that is your-file-name + _files, and a subfolder figure-html contains the images for the document.

We can see the figure we created above gg_avg_bmi_spec_gend in the figure-html folder below:

rmarkdown-in-practice_files/figure-html
└── gg_avg_bmi_spec_gend-1.png

We can also manually specify where we want the figures saved with fig.path=. If we’re setting a folder for the figures, we can do it in the code chunk,

```{r figure-title, fig.path="img/"}
# code to create figure...
```

Or in the setup chunk (but we need to make sure the folder exists!)

```{r setup, include=FALSE}
# create image folder
fs::dir_create(path = "img/")
knitr::opts_chunk$set(fig.path = "img/")
```

Most graphs also have options for saving, which we will demonstrate using the dm and starwarsdb packages to show how the Star Wars data are related to one another.

The starwarsdb package comes with a data model function (starwars_dm()), which we will pass to dm_draw() from the dm package. dm stands for ‘data model’, and this package is great for visualizing relational data

```{r StarWarsDataModel}
library(dm)
library(starwarsdb)
StarWarsDataModel <- dm_draw(dm = starwars_dm(),
                             graph_name = "StarWarsDataModel")
StarWarsDataModel
```

We can see the individual data tables, and which keys link them together.

This graph requires some additional steps to save as a .png, but we can see we’re allowed to specify the file and folder path in the rsvg::rsvg_png() function.

# packages to export
library(DiagrammeR)
library(DiagrammeRsvg)
library(rsvg)
# export file
StarWarsDataModel %>%
  DiagrammeRsvg::export_svg(gv = .) %>%
  base::charToRaw() %>%
  rsvg::rsvg_png(svg = ., 
                 height = 1440,
                 file = "../img/StarWarsDataModel.png")

Interactive Figures

The biggest benefit to using HTML is the ability to create interactive graphs. One example comes from the plotly package.

We can easily convert a ggplot2 graph to plotly using the toWebGL() and ggplotly() functions. We also remove the legend with plotly::hide_legend() so the plot looks identical to the version above.

library(plotly)
plotly::toWebGL(plotly::ggplotly(gg_avg_bmi_spec_gend)) %>%
  # remove legend
  plotly::hide_legend(p = .)